fev sex sex_male
1 1.708 F 0
2 1.724 F 0
3 1.720 F 0
4 1.558 M 1
5 1.895 M 1
6 2.336 F 0
2024-10-01
fev sex sex_male
1 1.708 F 0
2 1.724 F 0
3 1.720 F 0
4 1.558 M 1
5 1.895 M 1
6 2.336 F 0
Call:
lm(formula = fev ~ sex_male, data = fev_a)
Coefficients:
(Intercept) sex_male
2.4512 0.3613
The estimated average fev value is 2.45 liters for females. The estimated average fev value is 0.36 liters larger for males.
Call:
lm(formula = fev ~ sex_female, data = fev_b)
Coefficients:
(Intercept) sex_female
2.8124 -0.3613
data_dictionary: fev (.csv, sas7bdat, .sav, .txt)
copyright: |
The author of the jse article holds the copyright, but does not list conditions under which it can be used. Individual use for educational purposes is probably permitted under the Fair Use provisions of U.S. Copyright laws.
description: |
Forced Expiratory Volume (FEV) in children. The data was collected in Boston in the 1970s.
additional_description: https://jse.amstat.org/v13n2/datasets.kahn.html
download_url: https://www.amstat.org/publications/jse/datasets/fev.dat.txt
format:
csv: comma delimited
sas7bdat: proprietary (SAS)
sav: proprietary (SPSS)
txt: fixed width
varnames: not included
missing_value_code: not needed
size:
rows: 654
columns: 5
age:
scale: ratio
range: positive integer
unit: years
fev:
label: Forced Expiratory Volume
scale: ratio
range: positive real
unit: liters
ht:
label: Height
scale: positive real
unit: inches
sex:
value:
F: Female
M: Male
smoke:
value:
'FALSE': Nonsmoker
'TRUE': Smoker
---
title: "Linear regression modules using the fev dataset"
author: "Steve Simon"
format:
html:
embed-resources: true
date: 2024-09-25
---
There is a [data dictionary][dd] that provides more details about the data. The program was written by Steve Simon on 2024-09-02 and is placed in the public domain.
[dd]: https://github.com/pmean/datasets/blob/master/fev.yaml
## Libraries
You should always load the tidyverse library. The broom library provides the glance, tidy, and augment functions that help you with computations of linear regression models. The car library provides the vif function for measuing collinearity.
```{r setup}
#| message: false
#| warning: false
library(broom)
library(car)
library(tidyverse)
```
## List variable names
Since the variable names are not listed in the data file itself, you need to list them here.
```{r names}
pulmonary_names <- c(
"age",
"fev",
"height",
"sex",
"smoke")
```
## Reading the data
Here is the code to read the data and show a glimpse.
```{r read}
pulmonary <- read_csv(
file="../data/fev.csv",
col_names=pulmonary_names,
col_types="nnncc")
glimpse(pulmonary)
```
## m1: Linear regression model using sex to predict fev
Is there a relationship between sex and fev? Do males tend to have larger fev values than females. This section (labelled m1) shows some simple descriptive and graphical summaries followed by a linear regression model.
## m1: Descriptive stastics for sex
```{r sex}
pulmonary |>
count(sex) |>
mutate(total=sum(n)) |>
mutate(pct=100*n/total)
```
There are slightly more males (51%) than females in this sample.
## m1: Descriptive statistics for fev
```{r fev}
pulmonary |>
summarize(
fev_mn=mean(fev),
fev_sd=sd(fev),
fev_min=min(fev),
fev_max=max(fev))
```
The average fev value, 2.6, seems reasonable. The standard deviation, 0.87, indicates a fair amount of variation. The minimum and maximum values both appear to be reasonable.
## m1: Tabular summary of the relationship between sex and fev, 1
```{r sex-and-fev-1}
pulmonary |>
group_by(sex) |>
summarize(
fev_mn=mean(fev),
fev_sd=sd(fev))
```
The average fev is 0.36 liters higher for males than females. There is very slightly more variation in the males (the standard deviation is 1.00 versus 0.65).
## m1: Graphical summary of the relationship between sex and fev, 2
```{r sex-and-fev-2}
pulmonary |>
ggplot(aes(sex, fev)) +
geom_boxplot() +
coord_flip() +
ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
xlab("Sex") +
ylab("Forced Expiratory Volume (liters)")
```
The boxplot shows the same pattern slightly larger fev values for males compared to females and slightly more variation.
## m1: Fit the linear regression model
```{r m1-model}
m1 <- lm(fev ~ sex, data=pulmonary)
m1
```
The estimated average fev is 2.45 for females and 0.36 inches higher for males.
## m1: Analysis of variance table
```{r m1-anova}
anova(m1)
```
The F-statistic, 29.6, is large, and the p-value is very small (< 0.001). Reject the null hypothesis and conclude that the average fev values are different between males and females.
## m1: R-squared
```{r m1-r-squared}
glance(m1)$r.squared
```
Sex is a very weak predictor of fev. Only 4% of the variation in fev values can be accounted for by a patient's sex.
## m1: Confidence intervals
```{r m1-ci}
confint(m1)
```
We are 95% confident that the difference in fev values is between 0.23 and 0.49. This is a positive difference for all values in the confidence interval, demonstrating that the average fev values are larger for males than for females. This interval is narrow indicating that there is very little sampling error. Hardly a surprise with such a large dataset (n=654).
## m1: T-tests
```{r m1-t-tests}
tidy(m1)
```
The t-statistic, 5.4, is not close to zero. Conclude that there is a difference in average fev values between males and females. Since the t-statistic is positive, conclude also that the average fev value is larger for males.
## m1: Normal probability plot of residuals, 1
Note: I learn something new everyday. You do not have to use qqnorm to create a normal probability plot. You can do it using the ggplot and stat_qq functions. This looks nicer, is consistent with other visualizations in R, and allows you to put in a title using ggtitle. In your homework, you are welcome to use this approach (ggplot and stat_qq) or you can use qqnorm.
## m1: Normal probability plot of residuals, 2
```{r m1-qq-plot}
r1 <- augment(m1)
r1 |>
ggplot(aes(sample=.resid)) +
stat_qq() +
ggtitle("Graph drawn by Steve Simon on 2024-09-30") +
```
The straight lines indicates that the residuals are normally distributed.
## m1: Histogram of residuals
```{r m1-histogram}
r1 |>
ggplot(aes(.resid)) +
geom_histogram(
binwidth=0.2,
color="black",
fill="white") +
ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
xlab("Residuals from m1 regression model")
```
The histogram also indicates that the residuals are normally distributed.
## m1: Influential data points
Both leverage and Cook's distance make little sense for a regression model using a categorical independent variable. The studentized deleted residual is still useful.
```{r m1-studentized-deleted-residual}
r1 |>
filter(abs(.std.resid) > 3)
```
There are three outliers on the high end, corresponding to fev values of 5.6, 5.6, and 5.8 liters in males. There are no outliers among the female patients.
## m2: Linear regression model using smoke to predict fev
Is there a relationship between smoke and fev? This section (labeled m2) shows some simple descriptive and graphical summaries followed by a linear regression model.
## m2: Descriptive statistics for smoke
```{r smoke}
pulmonary |>
count(smoke) |>
mutate(total=sum(n)) |>
mutate(pct=100*n/total)
```
There are very few smokers (65 or 10%) in this sample. Descriptive statistics for fev were shown earlier.
## m2: Relationship between smoke and fev, 1
```{r smoke-and-fev-1}
pulmonary |>
group_by(smoke) |>
summarize(
fev_mn=mean(fev),
fev_sd=sd(fev))
```
Smokers have an average fev value that is 0.7 units higher than non-smokers. The standard deviations (0.85 and 0.75) demonstrate roughly the same amount of variation in the two groups.
## m2: Relationship between smoke and fev, 2
```{r smoke-and-fev-2}
pulmonary |>
ggplot(aes(smoke, fev)) +
geom_boxplot() +
coord_flip() +
ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
xlab("Did the patient smoke?") +
ylab("Forced Expiratory Volume (liters)")
```
The boxplot shows the same pattern as noted above.
## m2: Fit the linear regression model
```{r m2-model}
m2 <- lm(fev ~ smoke, data=pulmonary)
m2
```
The estimated average fev is 2.57 liters for non-smokers and 0.71 liters higher on average for smokers.
Normally, you would follow this up with various functions (anova, confint, tidy), assess various diagnostic plots using the residuals, and identify influential data points.
## m3: Linear regression model using age and height to predict fev
Previous analysis has shown that age by itself is a strong predictor of fev and height by itself is a strong predictor of fev. A multiple linear regression model including both age and height should do an even better job in predicting fev. This model will also allow you to compare the relative importance of the two variables. Is fev most strongly associated with how big a child is or how old that child is?
You should always start with descriptive statistics and graphs. With two or more independent variables, you should also examine the correlations between the independent variables and their correlations with the dependent variable.
## m3: Descriptive statistics for age
```{r age}
pulmonary |>
summarize(
age_mn=mean(age),
age_sd=sd(age),
age_min=min(age),
age_max=max(age))
```
The descriptive statistics are consistent with a pediatric study. The average age is 9.9 years. The standard deviation, 3.0, shows a large amount of variation in age (large at least for a pediatric study). The range, 3 years to 19 years, also demonstrates a large amount of variation.
## m3: Descriptive statistics for height
```{r height}
pulmonary |>
summarize(
ht_mn=mean(height),
ht_sd=sd(height),
ht_min=min(height),
ht_max=max(height))
```
The average height, 61 inches, is reasonable for a group of children with an average age of around 10 years. The standard deviation, 5.7 inches, and the range, 46 inches to 74 inches, show a moderate amount of variation.
## m3: Correlations
Reduce the pulmonary data frame to just the first three columns before computing correlations.
```{r corr}
pulmonary |>
select(height, age, fev) |>
cor()
```
The two independent variables, age and ht, are both strongly correlated with fev (r=0.76 and 0.87). They are also strongly correlated with one another (r=0.79).
## m3: Predicting fev using age and ht
```{r m3}
m3 <- lm(fev ~ age + height, data=pulmonary)
m3
```
The estimated average fev value increases by 0.05 liters for each increase of one year in age, holding height constant. The estimated average fev value increases by 0.11 liters for every increase of one inch in height. The estimated average fev is -4.6 for a patient of age zero with a height of zero inches. This is clearly and extrapolation beyond the range of the data.
## m3: Confidence intervals, 1
The use of [2, ] (and of [3, ]) isolates the individual rows of the data frame produced by confint. This is not really necessary, but is done to fit the information easily on separate slides.
```{r m3-ci-1}
m3
confint(m3)[2, ]
```
We are 95% confident that the estimated average fev value increases between 0.036 and 0.072 liters for each increase of one year of age holding height constant. Conclude that there is a positive relationship between fev and age.
## m3: Confidence intervals, 2
```{r m3-ci-2}
confint(m3)[3, ]
```
We are 95% confident that the estimated average fev value increases between 0.10 and 0.12 liters for each increase of one inch in patient's height holding age constant. Conclude that there is a positive relationship between height and fev.
Do not interpret the confidence interval for the intercept.
## m3: Analysis of variance table
```{r m3-anova}
anova(m3)
```
The sum of squares total (SST) is 280.9 + 95.3 + 114.7 = 490.9. Only a small portion of the variation (114.7) is unexplained variation. The F-ratio is much larger than 1, and the p-value is less than 0.001. You can conclude that the combination of age and height helps significantly in predicting fev.
## m3: R-squared
```{r m3-r-squared}
glance(m3)$r.squared
```
Roughly 77% of the variation in fev can be explained by the combination of the age and height of the patients.
## m3: t-tests
```{r}
tidy(m3)
```
# A tibble: 15 × 9
fev age height .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.16 7 47 0.926 0.239 0.0148 0.420 0.00165 0.574
2 2.91 18 66 3.61 -0.702 0.0200 0.419 0.0194 -1.69
3 5.10 19 72 4.32 0.782 0.0171 0.419 0.0205 1.88
4 3.52 19 66 3.66 -0.143 0.0262 0.420 0.00107 -0.345
5 3.34 19 65.5 3.61 -0.262 0.0274 0.420 0.00376 -0.633
6 3.08 18 64.5 3.44 -0.361 0.0231 0.420 0.00598 -0.870
7 2.90 16 63 3.17 -0.267 0.0150 0.420 0.00208 -0.641
8 4.22 18 68 3.83 0.393 0.0167 0.420 0.00506 0.944
9 3.5 17 62 3.11 0.386 0.0228 0.420 0.00672 0.929
10 2.61 16 62 3.06 -0.452 0.0170 0.420 0.00679 -1.09
11 4.09 18 67 3.72 0.369 0.0183 0.420 0.00487 0.887
12 4.40 18 70.5 4.10 0.303 0.0141 0.420 0.00251 0.726
13 2.28 15 60 2.79 -0.508 0.0160 0.420 0.00810 -1.22
14 2.85 18 60 2.95 -0.0963 0.0359 0.420 0.000678 -0.234
15 2.80 16 63 3.17 -0.375 0.0150 0.420 0.00410 -0.900
# A tibble: 7 × 9
fev age height .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.72 8 67.5 3.23 -1.51 0.0131 0.416 0.0578 -3.61
2 5.22 12 70 3.72 1.50 0.00637 0.416 0.0276 3.59
3 2.54 14 71 3.94 -1.40 0.00610 0.416 0.0229 -3.35
4 2.22 13 68 3.56 -1.34 0.00377 0.417 0.0129 -3.20
5 5.79 15 69 3.77 2.02 0.00604 0.412 0.0472 4.83
6 5.63 17 73 4.32 1.31 0.0104 0.417 0.0347 3.14
7 5.64 17 70 3.99 1.65 0.0108 0.415 0.0565 3.94
In the pulmonary database, no combination of high leverage and extreme studentized residuals is going to cause concern.
Note: interpreting influential values gets tricky with two independent variables.
## m3: Diagnostic plots, 1
```{r diagnostic-1}
r3 <- augment(m3)
r3 |>
ggplot(aes(sample=.resid)) +
stat_qq() +
ggtitle("Graph drawn by Steve Simon on 2024-09-30") +
```
## m3: Diagnostic plots, 2
```{r diagnostic-2}
r3 |>
ggplot(aes(.resid)) +
geom_histogram(
binwidth=0.2,
color="black",
fill="white") +
xlab("Residuals from m3 regression") +
ggtitle("Graph drawn by Steve Simon on 2024-09-25")
```
## m3: Diagnostic plots, 3
```{r diagnostic-3}
r3 |>
ggplot(aes(.fitted, .resid)) +
geom_point() +
xlab("Predicted values from m3 regression") +
ylab("Residuals from m3 regression") +
ggtitle("Graph drawn by Steve Simon on 2024-09-25")
```
## m3: Influential data points, 1
```{r influence-1}
n <- nrow(r3)
r3 |> filter(.hat > 3*3/n)
```
## m3: Influential data points, 2
```{r influence-2}
r3 |>
filter(abs(.std.resid) > 3)
```
## m3: Influential data points, 3
```{r influence-3}
r3 |>
filter(.cooksd > 1)
```
## m3: Variance inflation factor
```{r vif}
library(car)
vif(m3)
```
---
title: "Directions for 5501-07 programming assignment"
format:
html:
embed-resources: true
date: 2024-09-27
---
This file was written by Steve Simon on 2024-09-27 and is placed in the public domain.
## Program
- Download [simon-5501-07-fev.qmd][tem]
- Store it in your src folder
- Modify the file name
- Use your last name instead of "simon"
- Modify the documentation header
- Add your name to the author field
- Optional: change the copyright statement
[tem]: https://github.com/pmean/classes/blob/master/biostats-1/07/src/simon-5501-07-fev.qmd
## Data
- Download [breast-feeding-preterm.csv][dat]
- Store it in your data folder
- Refer to the [data dictionary][dic] if needed
[dat]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.csv
[dic]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.yaml
## Question 1
Change the program so that it reads in the breast-feeding-preterm.csv file. Show a glimpse of the data and verify that you have properly read in all 82 rows and 31 columns. No intepretation is necessary for this question.
## Question 2
Compute descriptive statistics (counts and percentages) for feed_type. Interpret these values.
## Question 3
Compute descriptive statistics (mean, standard deviation, minimum, and maximum) for age_stop. Interpret these values.
## Question 4
Draw a boxplot comparing age_stop for each level of feed_type. Interpret this plot.
## Question 5
Calculate the means and standard deviations for each level of feed_type. Interpret these numbers.
## Question 6
Compute a linear regression model predicting age_stop using feed_type. What value does R assign to 0 and what value does R assign to 1? Interpret the slope and intercept for this linear regression model.
## Question 7
Compute R-squared for this regression model. Interpret this number.
## Question 8
Draw a normal probability plot and a histogram for the residuals from this regression model. Is the assumption of normality satisfied?
## Question 9
Calculate descriptive statistics (mean, standard deviation, minimum, and maximum) for mom_age and para. Interpret these values.
## Question 10
Calculate the correlations between mom_age, para, and age_stop. Interpret these values.
## Question 11
Draw a scatterplot with mom_age on the x-axis and age_stop on the y-axis. Repeat this with para on the x-axis. Interpret these plots.
## Question 12
Compute a linear regression model using mom_age and para to predict age_stop. Interpret the regression coefficients.
## Question 13
Compute R-squared for this regression model. Interpret this number.
## Question 14
Draw a normal probability plot and a histogram of the residuals. Interpret these plots.
## Question 15
Draw a plot with the predicted values on the x-axis and the residuals on the y-axis. Is there any evidence of heterogeneity or non-linearity?
## Question 16
Display any extreme values for leverage (greater than 3*2/n), studentized deleted residuals (absolute value greater than 3), and for Cook's distance (greater than 1). Explain why these values are extreme.
## Your submission
- Save the output in html format
- Convert it to pdf format.
- Make sure that the pdf file includes
- Your last name
- The number of this course
- The number of this module
- Upload the file
## If it doesn't work
If your program has any errors or fails
to produce the output that you desire
and you can't resolve the problem,
upload the program file along with the
pdf file to help us figure out what
went wrong. You will get a chance to
resubmit the assignment if needed.